The Annals of Applied Statistics — Latest Matching Preprints

1

The ATLAS Penalty: Auxiliary-Transformed Location-Aware Smoothing with Applications to Spatial Transcriptomics

Tang, Q.; Chi, E. C.; Wang, W.

2026-05-20 bioinformatics 10.64898/2026.05.18.725545 medRxiv

Top 0.1%

0.9%

Show abstract

We address the problem of fitting a collection of location-specific models under a spatial smoothness assumption. Existing approaches penalize roughness in the model parameters directly, an assumption that breaks down when smoothness is a function of parameters and auxiliary covariates rather than the parameters themselves. Our framework, the Auxiliary-Transformed Location-Aware Smoothing (ATLAS) penalty, generalizes spatial smoothness by penalizing roughness in transformations of model parameters using auxiliary information. As a concrete case study, we develop a spatially smooth deconvolution model for spatial transcriptomics that estimates tumor mixing coefficients from thousands of spots distributed on a single tissue slide. To handle the computational challenges posed by the nonlinear likelihood, nonsmooth nonconvex penalty, and spatially coupled estimation, we propose an alternating direction method of multipliers (ADMM) algorithm. Through simulation studies, we demonstrate that our framework provides substantially better spatial domain detection than approaches that smooth model parameters directly, with particularly strong gains when auxiliary covariates carry calibrated spatial structure.

2

Granger Sensori-Behavioral Taxonomy of Neuronal Ensemble Activity from Two-Photon Calcium Imaging Data

Khosravi, S.; Francis, N. A.; Kanold, P. O.; Babadi, B.

2026-05-15 neuroscience 10.64898/2026.05.12.724603 medRxiv

Top 0.1%

0.9%

Show abstract

Understanding how neuronal populations interact to encode and transform sensory information is a fundamental challenge in computational neuroscience. Most existing studies, however, study neural encoding, behavioral readout, and functional connectivity as disjoint problems. Two-photon calcium imaging enables simultaneous recording of large neuronal ensembles in vivo, driven by diverse stimuli and eliciting distinct behaviors. However, extracting directional functional connectivity metrics as well as encoding and readout properties of neurons from such data remains difficult due to indirect and noisy observations of spiking activity, slow temporal dynamics, and the latent interplay between external stimuli and endogenous neural processes. Here, we introduce a unified conceptual and operational modeling and inference framework for directly extracting functional Granger causal (GC) effects between neurons, from external stimuli to neurons, and from neurons to behavior, from two-photon imaging data, in the sense of Granger. Inspired by the intersection information framework, we also identify neurons that encode features of sensory stimuli that inform behavioral readout. The resulting GC networks together with the taxonomy of functional sensori-behavioral relevance, which we call G-taxonomy, provides a powerful statistical analysis framework, enabled by the integration of several techniques including state-space modeling and inference, variational inference, and point processes. We applied the proposed framework to simulated and experimentally-recorded two-photon imaging from the mouse auditory cortex (A1) during both passive listening and active tone discrimination. Our simulation studies reveal significant improvement of our proposed methodology over existing techniques. Analysis of experimental data from the mouse A1 identifies distinct groups of cells with diverse sensori-behavioral relevance, as well as changes in functional connectivity associated with correct vs. incorrect behavior. In summary, this work provides a principled and data-driven methodology for uncovering directional interactions among the neurons, sensory stimuli, and behavior, all within the same statistical framework, offering new insights into how distributed cortical populations transform sensory inputs into behaviorally relevant representations. Author SummaryThe brain processes sensory inputs through the coordinated activity of large networks of neurons and produces readouts that elicit behavior. Understanding how information flows and is processed through these networks is a central goal of neuroscience. In this study, we present a new computational framework that identifies directional interactions among neurons in an ensemble as well as from sensory stimuli to neurons and from neurons to behavior. Utilizing the Granger formalism to identify directional effects, as opposed to common correlational measures, our framework extracts said effects directly from two-photon calcium imaging data. We tested our proposed method on both simulated data and recordings from the auditory cortex of mice during passive listening and active tone discrimination tasks. Our method revealed diverse groups of neurons in the auditory cortex with distinct functional roles and relevance to sensori-behavioral integration. Our framework provides a new way to study the flow of information in the brain and can be broadly applied to uncover neural computations across sensory and cognitive systems.

3

Spurious correlation inflates performance in single-cell perturbation prediction

Nicol, P. B.; Shivakumar, S.; Irizarry, R.

2026-05-12 bioinformatics 10.64898/2026.05.07.723486 medRxiv

Top 0.1%

0.8%

Show abstract

The increasing number of computational methods designed to predict the effects of genetic perturbations on cellular gene expression profiles has led to a need for rigorous evaluation metrics. Recent benchmarking studies rely on correlation or cosine similarity of differential expression relative to a shared population of control cells. We show that these metrics are systematically inflated by statistical bias induced by reusing the same control population to define both quantities being compared. As a result, even non-informative methods can appear to perform well, particularly in datasets with limited numbers of control cells. Reanalysis of published datasets using a simple control-splitting procedure that removes this bias leads to a substantial reduction in performance previously attributed to biological signal.

4

Synthetic Data Generation and Nonparametric Techniques for Assessing Multivariate Similarity to Address Small-Sample Size Challenges

Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.

2026-05-07 bioinformatics 10.64898/2026.05.04.722226 medRxiv

Top 0.1%

0.7%

Show abstract

Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.

5

Robust Inference of Individualized Treatment Effect in Mendelian Randomization

Liang, M.; Wu, R.; Xiao, F.; Li, X.

2026-05-12 genetics 10.64898/2026.05.08.723855 medRxiv

Top 0.1%

0.6%

Show abstract

Mendelian randomization (MR) is widely used to draw causal conclusions in the presence of unmeasured confounding, but most MR analyses focus on average treatment effects and rely on strong assumptions. For precision medicine, the primary target is instead the individualized treatment effect (ITE); yet in MR, such effects are not point-identified under core IV assumptions, and valid inference is particularly challenging. We therefore propose a robust partial identification inference framework for ITE under MR allowing multiple instruments. Under minimal causal assumptions, we derive a sharp inference procedure for the intersection bounds of ITE by adopting a multiplier bootstrap procedure with data-adaptive bootstrap distribution shifting and heterogeneous variance adjustment. In theory, we prove that the proposed method achieves nominal coverage and asymptotic sharpness. Further, we extend the procedure to tolerate possible invalid IVs under a minimal proportion rule assumption by aggregating over instrument subsets while preserving coverage. Simulation studies demonstrate that the proposed methods attain nominal coverage and substantially shorter intervals than existing procedures. We illustrate the framework using data from the Alzheimers Disease Neuroimaging Initiative to assess heterogeneous causal effects of TREM2 expression on Alzheimers disease risk across education-defined subgroups.

6

Simpler is not always better: Phylodynamic misspecification and deep-learning corrections

XIE, R.; Gascuel, O.; ZHUKOVA, A.

2026-05-08 epidemiology 10.64898/2026.05.07.26352661 medRxiv

Top 0.1%

0.5%

Show abstract

Phylodynamics bridges the gap between epidemiology and pathogen genetic data by estimating epidemiological parameters from time-scaled pathogen phylogenies. Multi-type birth-death (MTBD) models are phylodynamic analogies of compartmental models in classical epidemiology. They serve to infer the average number of secondary infections R and the infection duration d. Moreover, more complex MTBD models add extra parameters, such as the average length of the incubation period or the proportion of superspreaders in the infected population. However, these additional parameters come at an important computational cost: Apart from the simplest, BD, model, MTBD models do not have a closed-form solution and require numerical methods for their likelihood computation. This leads to increased computational times and potential numerical errors. Therefore, the BD model remains the favorite researchers choice for real dataset analyses, and is often applied even in cases where more complex epidemiological aspects are present. We investigated, using simulations, how model misspecification influences inference of R and d in the phylodynamic framework. We showed that the use of models not accounting for various epidemiological aspects leads to bias. In particular the simplest, BD, estimator tends to underestimate R in the presence of super-spreading or incubation, which might be dangerous from the public health prospective. However, deep-learning-based estimators for complex models, which account for multiple epidemiological factors, perform well both on the data where those factors are present and where they are absent. This advocates for the use of complex epidemiologically realistic estimators, whose design has recently become possible thanks to deep learning.

7

Robust identification of cell-cell communication heterogeneity in single cells

Bocci, F.; Jia, Y.; Atwood, S.; Nie, Q.

2026-05-04 bioinformatics 10.64898/2026.04.29.721691 medRxiv

Top 0.2%

0.5%

Show abstract

Communication between cells modulates cell fate decisions by relaying information across tissues and inducing intracellular responses mediated by gene regulatory networks. Inference of cell-cell communication from high throughput data such as single cell transcriptomics is gaining popularity due to the high data availability and ease to automate modeling over hundreds of signaling pathways. Studying how cell-cell communication operates across biological scales and influences cell fate decisions, however, remain a major open question. Here, we present scRICH, a framework and package that integrates mechanism-based, multiscale mathematical modeling with learning strategies to capture the complexity of cell-cell communication from single-cell and spatial transcriptomics data. scRICH unravels the heterogeneity of communication behavior within cell types, links cell-cell communication to cell fate decisions by incorporating dynamical information of RNA splicing, and connects the scales of cell-cell interactions and intracellular response by constructing multilayer regulatory networks. We validate scRICH with new experiments on EGF ligand/receptor co-expression in keratinocytes from skin-equivalent organoid, and compare these computational predictions against existing CCC inference methods. Applying scRICH to multiple biological scenarios demonstrate its ability to capture emerging relations between distinct cell-cell communication pathways, interactions at the onset of cell fate decision, and emerging trends in cell-cell communications along cell lineages and in space.

8

A Beta-Binomial Model for Estimating Zero- or One-inflated Pain Trajectories

Liu, Y.; Harris, R. E.; Clauw, D.; Bayman, E.; Leroux, A.; Lindquist, M. A.

2026-05-11 bioinformatics 10.64898/2026.05.07.721507 medRxiv

Top 0.2%

0.4%

Show abstract

Chronic pain is a widespread public health issue that imposes substantial health, emotional, and economic burdens on individuals and communities. Because pain is subjective and lacks objective biomarkers, it is typically measured using patient-reported scores, often on a numerical scale from zero to ten. Increasingly, pain studies use ecological momentary assessment, with multiple daily assessments over days and across study phases (e.g., a series of baseline and post-intervention assessments). These data frequently show many ratings at the extremes (i.e., at minimum or maximum pain scores), commonly referred to as zero- and one-inflation in the statistical literature, along with considerable within-person variability both within and across days. These phenomena present challenges for statistical analyses, as they violate assumptions of most commonly used statistical techniques (e.g., the normality assumption of linear mixed models). We propose a Bayesian beta-binomial mixed-effects model for modeling potential zero- or one-inflated pain scores while accounting for variability using random effects on the mean and variance parameters across subjects. A simulation study demonstrates that the method accurately estimates model parameters across realistic sample sizes, time points, and zero- and one-inflation levels. An application to data from two longitudinal pain studies demonstrates that the model fits the data better and, when correctly specified, yields accurate uncertainty intervals for longitudinal changes in pain compared to existing models, especially for zero- and one-inflated outcomes. Additionally, the model directly estimates the probability of clinically meaningful pain events. The proposed method provides a powerful statistical framework for studying the patient-reported pain trajectories.

9

MOSAIC: Model-based, Subgroup-Aware Identification of Driver Mutations in Cancer

Campbell, K.; Reyna, M. A.

2026-05-03 bioinformatics 10.64898/2026.04.29.721672 medRxiv

Top 0.2%

0.4%

Show abstract

In cancer genomics, recurrent patterns of mutual exclusivity within a gene set can indicate shared biological context and involvement in tumorigenesis. However, existing methods are not designed to distinguish between mutual exclusivity arising from meaningful biological interactions from those influenced by heterogeneity between underlying patient subpopulations. In this work, we introduce MOSAIC, a novel statistical framework that models patient subgroup heterogeneity in mutual exclusivity analyses. In experiments with simulated data and real data from The Cancer Genome Atlas, we show that MOSAIC amplifies subgroup-specific mutual exclusivity signals, including between IDH1 and IDH2 in young low grade glioma patients, while reducing the effect of signals produced by underlying subgroup structures, such as distinct genomic lineages associated with histological subtypes of endometrial cancer. Finally, we demonstrate that MOSAIC is more powerful than existing p-value combination methods for patient subgroup stratification. MOSAIC is available as an open-source tool at https://github.com/reynalab/mosaic.

10

A Bayesian modelling framework for inference of latent infection risk patterns from virus neutralisation assay titration data

Alrefae, T. A.; Pons-Salort, M.; Donnelly, C. A.; Lambert, B.; Kamau, E.

2026-05-21 bioinformatics 10.64898/2026.05.18.726027 medRxiv

Top 0.3%

0.3%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWSerological assays remain the standard experimental approach for estimating the cumulative incidence of a pathogen and monitoring population immunity. The predominant approach for analysing serum titration data from virus neutralisation assays uses a nearly century-old interpolation-based method which neglects inherent imperfections in the assay and produces estimates with no measure of uncertainty. We introduce a two-part Bayesian modelling framework to estimate the underlying antibody concentrations in the raw serum samples taken from serosurveyed individuals, to improve the interpretation of serological data over age. First, we develop a mechanistic Bayesian model for serum antibody titration data that estimates latent antibody concentrations while accounting for assay variability and quantifying uncertainty. Second, we propagate this uncertainty into an age-structured serocatalytic model by integrating over posterior draws of individual antibody concentrations, allowing joint inference on latent serostate membership, force of infection, and serological waning rate. We use this framework to explore the dynamics of infection and immunity for three enterovirus serotypes: enteroviruses A71 (EV-A71) and D68 (EV-D68) and coxsackievirus A6 (CVA6). These serotypes are leading causes of outbreaks of severe respiratory illness and hand, foot, and mouth disease. Applying these approaches to three cross-sectional serosurveys, we estimated consistently higher and more persistent antibody concentrations throughout life for EV-D68 compared to EV-A71 and CVA6. Our analysis suggests that the proportion of recently infected individuals (i.e. individuals with high estimated antibody concentration levels given their age) peaks around 25% by age 7 years for both EV-A71 and CVA6 before gradually declining with age. In contrast, for EV-D68 the inferred proportion of the population in the infected state exceeds 50% by age 9 years and continues to grow with age. We also estimate that EV-D68 antibody concentration levels are higher than those of the other two serotypes, with the force of infection estimated to be highest in early childhood and declining more gradually with age than for EV-A71 and CVA6. These estimates are different to previous estimates found in the literature. Our inferential framework uncovers the wide-ranging variation in antibody levels that are often obscured by conventional endpoint titre estimation methods. We demonstrate that our framework can infer infection rates without relying on predetermined seropositivity cut-offs and without making explicit assumptions of virus-specific infection mechanisms. Author summarySerological tests measure antibody levels in blood to show how widely a virus has spread and how well populations are protected. Titre-based tests dilute blood samples in steps, mix these dilutions with virus, and add the mixture to living cells; the titre is the highest dilution where antibodies still protect cells from infection. Traditional analyses overlook test imperfections. We present a new two-part Bayesian framework to estimate antibody levels and track age-related exposure to infection. First, we estimate underlying antibody concentrations while accounting for uncertainty, then use these estimates in another model to infer age-specific transmission of three common viruses - EV-A71, EV-D68, and CVA6. Our results show that EV-D68 infections may be more common, especially in children, compared to the other viruses. This new approach provides a clearer picture of the dynamics of seroconversion, without relying on arbitrary thresholds, helping to improve public health monitoring and responses.

11

Efficient Stochastic Trace Generation for Transcription

Ferdowsi, A.; Fuegger, M.; Nowak, T.

2026-05-08 bioinformatics 10.64898/2026.05.05.722871 medRxiv

Top 0.3%

0.3%

Show abstract

Bursty transcription in single cells typically produces over-dispersed, skewed, and sometimes heavy-tailed expression distributions that are explained by two-state Markov models of the promoters. While the gold standard for simulation is exact stochastic sampling with Gillespies algorithm, obtaining thousands of timed traces is computationally costly. Surrogate models based on stochastic differential equations (SDEs) are widely used to speed up this simulation process. An example is the Chemical Langevin Equation based on Gaussian noise, which, however, does not capture heavy-tailed noise. In this work, we present a unified SDE framework that combines deterministic drift, Gaussian fluctuations, and additive sporadic jumps of arbitrary distributions, and provide an open-source Python implementation, bcrnnoise. The framework subsumes standard surrogate models and allows for vectorized generation of batches of transcription traces. We assess computational speed and accuracy of common surrogate models along with new models, showing that high accuracy can be obtained while reducing computational cost up to two orders of magnitude.

12

Multi-resolution Spatial Graphical Regression Models for Hierarchical Spatial Transcriptomics Data

Chen, L.; Acharyya, S.; May, A. M.; Udager, A. M.; Keller, E. T.; Baladandayuthapani, V.

2026-05-15 genomics 10.64898/2026.05.12.724724 medRxiv

Top 0.3%

0.3%

Show abstract

Advances in spatial transcriptomics (ST) technologies enable systematic molecular characterization of tumor microenvironment, tumor gradients and gene regulatory networks. Cancer progression is known to vary along pathological gradients, yet existing network approaches for gene network inference typically ignore hierarchical spatial organization across the tumor. We develop a Bayesian multi-resolution spatial graphical regression (mSGR) framework to infer spatially varying gene networks from multi-resolution ST data. The proposed model allows precision matrices to vary across hierarchically structured spatial domains, capturing both local and global organization within the tumor. To identify spatially varying regulatory relationships, we introduce a spatially structured edge selection strategy that borrows strength across regions according to spatial proximity and pathological gradients, while Gaussian-process priors flexibly model spatial variation in edge strengths. Scalable inference is achieved through an augmented mean-field variational Bayes algorithm with node-wise parallel regressions, enabling efficient estimation in high-dimensional settings. Simulation studies demonstrate improved recovery of network structures compared with competing approaches. Applying mSGR to multi-resolution ST data from kidney cancer reveals stronger regulatory connectivity in transitional regions of epithelial-mesenchymal transition pathway and identifies hub genes along the tumor gradient, illustrating how spatially resolved network analysis can provide key insights into tumor microenvironment organization.

13

Fisher information matrix computation for joint longitudinal and survival models to support clinical study design and covariate effect assessment

Fayette, L.; Brendel, K.; Mentre, F.

2026-06-01 pharmacology and therapeutics 10.64898/2026.05.28.26354340 medRxiv

Top 0.3%

0.3%

Show abstract

Joint modelling of longitudinal data using non-linear mixed effects models and time-to-event outcomes provides a suitable framework to account for informative censoring when estimating biomarker dynamics and quantifying event risk using covariates and longitudinal trajectories. Their usefulness in clinical research depends on data collection design, particularly to precisely estimate the association (link) parameter between longitudinal and survival processes. However, optimal design strategies have so far been addressed separately for longitudinal and survival endpoints and remain unexplored for joint models. We propose two Fisher Information Matrix (FIM) computation methods for joint models, relying on Monte-Carlo integration over observations combined with either Markov Chains Monte-Carlo or Adaptive Gaussian Quadrature to integrate random effects. Their accuracy is assessed against clinical trial simulations in an oncological example based on the HORIZON III study with a tumour-growth-survival model including discrete and continuous covariates. We apply these methods to quantify the impact of follow-up duration, sampling richness, sample size, and covariate distribution on parameter uncertainty and test power. In our example, longitudinal-parameter uncertainty is barely affected by follow-up duration or sampling richness, whereas survival-parameter uncertainty decreases substantially from 1-year to 2-year follow-up. The number of subjects needed (NSN) to achieve <15\% uncertainty on the link parameter is comparable for a 2-year rich design and a 3-year sparse design. Optimal covariate distributions are stable across designs and systematically improve test power, outperforming longer and richer but non-optimised designs. These FIM-based methods accurately predict uncertainty and test powers, enabling design evaluation and NSN computation for joint-model-based clinical studies.

14

Estimating uncertainty in family-based GWAS

Miao, X.; Edge, M. D.; Harpak, A.

2026-05-14 genetics 10.64898/2026.05.11.724392 medRxiv

Top 0.3%

0.3%

Show abstract

Standard genome-wide association studies (GWASs) are vulnerable to confounding factors, including stratification, assortative mating, and dynastic effects. Family studies such as sibling-based GWAS (sib-GWAS) mitigate such confounding and are becoming the tool of choice for teasing apart direct genetic effects--causal effects of ones genotype on ones own phenotype-- from other factors. However, due in part to their smaller sample sizes, sib-GWAS allelic effect estimates are substantially more variable than standard (i.e., population-based) GWAS estimates. The quantification of this uncertainty is essential for many uses of sib-GWAS, including polygenic scoring, causal inference (e.g., Mendelian randomization), disentangling direct from indirect familial effects, and measuring assortative mating. Here, we investigate sources of uncertainty in sib-GWAS allelic effect estimators. We study their impacts on the biases of three uncertainty measurement methods, including two that are commonly used and a new resampling-based approach we propose. We find that heterogeneity in allelic effects or heteroskedasticity across families (e.g., due to variation in genetic backgrounds or environments) can bias existing methods, and that this bias is more severe for small samples and rare variants. In contrast, the resampling-based approach we propose is approximately unbiased under all scenarios we considered. We validate our theoretical predictions, as well as the importance of effect heterogeneity and heteroskedasticity, using simulations and empirical analysis in the UK Biobank. In sum, this study helps understand the sources of uncertainty in family-based genotype-phenotype association studies and provides a robust method to estimate uncertainty.

15

Differential Expression Analysis for Longitudinal Single-Cell RNA-Sequencing Studies Using REBEL

Wynn, E. A.; Mould, K. J.; Vestal, B. E.; Moore, C. M.

2026-05-11 genomics 10.64898/2026.05.06.723139 medRxiv

Top 0.4%

0.3%

Show abstract

Longitudinal scRNA-seq experiments offer a powerful approach for dissecting temporal gene expression dynamics in individual cell types. However, few methods have been developed specifically to address the unique statistical challenges of repeated measures in scRNA-seq data. Here, we introduce a novel method, REBEL (Repeated measures Empirical Bayes differential Expression analysis using Linear mixed models), for analyzing cell type-specific differential expression in repeated measures scRNA-seq experiments. Using simulation studies, we demonstrate that, relative to conventional repeated measures analysis methods and other scRNA-seq approaches, REBEL controls the false discovery rate and exhibits competitive power across a range of simulation scenarios. We further validate REBEL by analyzing a longitudinal scRNA-seq dataset from patients with B-cell lymphoma receiving chimeric antigen receptor (CAR)-T cell therapy. REBEL is implemented as an R package, available at https://github.com/ewynn610/REBEL.

16

Cell Type Weighted Dimensionality Reduction

Putta, S.; Jensen, W.; Devakonda, S.; Pennell, L.; Croteau, J.

2026-05-05 bioinformatics 10.64898/2026.04.30.721796 medRxiv

Top 0.4%

0.3%

Show abstract

High-dimensional single-cell technologies, such as flow cytometry and CITE-Seq, typically rely on established lineage markers to define cell identities. Additional markers are commonly analyzed within the context of these predefined cell types. Nonlinear projection methods such as t-SNE and UMAP provide a visual framework for this analysis by enabling the overlay of cell types and marker expression. However, these methods frequently produce projections where distinct cell types substantially overlap, hindering interpretation of marker expression patterns relative to known cell types. In this study, we investigate the underlying causes of this phenomenon and demonstrate that such overlaps often stem from the inherent high-dimensional structure of the data rather than limitations in the dimensionality reduction algorithms themselves. To address this, we introduce Cell Type Weighted Dimensionality Reduction (CWDR), a novel approach that incorporates lineage-based information through a supervised weighting mechanism. By integrating both cell identity and marker expression, CWDR preserves the visual separation between predefined cell types while maintaining the local variance necessary for downstream analysis. We validate our method across multiple high-dimensional flow cytometry and proteogenomic datasets. Our results show that CWDR significantly reduces inter-cluster overlap compared to traditional methods, providing a clearer framework for visualizing marker expression within the context of specific cell lineages.

17

Cell-Level Virtual Screening

Ellington, C. N.; Addagudi, S.; Wang, J.; Lengerich, B. J.; Xing, E. P.

2026-05-13 bioinformatics 10.64898/2026.05.11.724149 medRxiv

Top 0.4%

0.2%

Show abstract

Virtual screening methods prioritize therapeutic candidates by predicting molecular properties and interactions. However, molecular models are insufficient to predict higher-order effects that arise in real biological systems, leading to late-stage failures in drug discovery. Virtual cells have been posed as a solution to this problem by predicting gene expression responses to drugs, but they remain weakly validated as screening tools; gene expression is only an intermediate in understanding drug success or failure. Despite burgeoning progress in virtual cells, some basic questions remain. Is expression even a good representation of higher-order drug effects? How can expression and other cell-level representations be applied to prioritize therapeutic candidates? Can cell-level methods be fairly compared against traditional molecular-level screens? We address these questions in a two-pronged approach. First, we curate two benchmarks, Drug-Disease Retrieval Bench (DDR-Bench) and Drug-Target Retrieval Bench (DTR-Bench), which directly compare cell-level methods against traditional molecular methods on canonical drug discovery tasks. DDR-Bench evaluates a methods ability to prioritize disease indications for drugs with novel target profiles. DTR-Bench evaluates a methods ability to reconstruct drug-target interactions from separate perturbation modalities that act on shared mechanisms, bridging the gap between cell-level methods and classic molecular screens. We identify shortcomings of existing screening methods on these benchmarks, and propose an alternative representation of drug effects: perturbed gene networks. Inferring post-perturbation gene networks on-demand for unseen drugs requires methods that generalize beyond traditional plug-in network estimators. We develop a scalable differentiable surrogate loss for multivariate Gaussians, which we apply to train a context-adaptive amortized estimator that maps perturbation metadata to gene-gene dependency network parameters. The resulting model, CellVS-Net, achieves SOTA on predicting how gene networks restructure under a variety of complex multivariate experimental conditions, including different cell types, small molecule therapeutics, signaling molecules, gene knockdowns, and gene over-expressions. When compared to other molecular and cell-level representations of drugs, we find that CellVS-Net achieves SOTA on both virtual screening benchmarks. Overall, CellVS-Net demonstrates that cell-level virtual screening methods are a viable alternative to molecular screening, and associated benchmarks enable hill-climbing on relevant drug discovery tasks.

18

From time-course expression to gene regulation: direct linear ODE inference without finite-difference approximation

Huang, X.; Ang, A.; Vasoya, A. P.; Wang, Y.; Teresa, P.

2026-05-20 systems biology 10.64898/2026.05.18.726023 medRxiv

Top 0.5%

0.2%

Show abstract

Inferring gene regulation from time-course expression profiles is essential for understanding how cells transition between states during development, differentiation, and disease progression. Existing approaches often model expression dynamics with ordinary differential equations (ODEs). However, due to the computational complexity of directly solving these ODE models, most methods rely on finite-difference approximations of temporal derivatives, which can amplify measurement noise, introduce discretization bias, and lead to unstable or biased parameter estimates. To fill this gap, we develop the first computational method to directly learn a linear ODE model for gene regulation inference without relying on finite-difference approximations. We first formulate an optimization problem that directly exploits the closed-form solution of the linear ODE system. We then solve this problem via gradient descent, deriving analytical gradients with respect to the model parameters; these gradients involve matrix exponentials and integrals, which are challenging to directly compute. To make the computation efficient, we further use high-order Taylor approximations of the gradients whose truncation error is on the order of machine precision. In addition, we establish theoretical results demonstrating an inherent, non-vanishing gap between our exact solution and solutions derived from finite-difference approximations, which underscores the theoretical advantages of our approach. Finally, we demonstrate that our method consistently outperforms competing approaches on both simulated data and real-world scRNA-seq datasets in terms of AUROC. Our source codes can be accessed here: https://github.com/EJIUB/ExactLinearODE

19

On the Optimal Temporal Resolution for Information Representation in Neural Activity: A Theoretical Analysis

Ahmed, H. F.; Samiei, T.; Nozari, E.

2026-05-21 neuroscience 10.64898/2026.05.19.726394 medRxiv

Top 0.5%

0.2%

Show abstract

IntroductionAlthough neural activity is organized across temporal and spatial scales, the principles that determine the accuracy and fidelity of neural information representation across scales remain unclear. In particular, while recent empirical results have reported mesoscopic optimality in neural decoding, no theoretical accounts exist that explain when and why such intermediate scales emerge as optimal. Here, we develop an analytical framework to study the optimal temporal scale of information representation and its dependence on the dynamic structure of signal and noise in neural data. Materials and MethodsWe formulate a multiscale theoretical model in which neural population activity is represented by temporally encoded trial vectors at microscopic, mesoscopic, and macroscopic resolutions. Neural responses are modeled as class-dependent mean activations (signal) corrupted by temporally correlated noise, and decay rates of correlations in both signal and noise are varied parametrically. Representational quality at each scale is quantified using the sensitivity index (d-prime) for decoding condition from neural activity. ResultsWe derive closed-form expressions for the sensitivity index at each temporal scale. These expressions reveal the key roles of signal and noise correlations as the main determinants of condition decodability at all scales. Comparing expressions under various combinations of signal and noise correlations reveals two regimes. First, when signal and noise correlations are absent or persistent over time, the optimal resolution falls at one of two extremes: macroscale (resp. microscale) if signal correlations are stronger (resp. weaker) than noise correlations. In contrast, when both signal and noise correlations decay with temporal separation, temporal integration produces a nontrivial trade-off: moderate integration improves decodability by suppressing noise while preserving coherent signal, whereas excessive integration degrades signal and decodability. Therefore, only in the latter regime, mesoscopic representations emerge as optimal across a broad range of biologically plausible parameters. DiscussionThis work provides a theoretical explanation for how the optimal temporal scale of neural information representation depends on the interplay between signal and noise correlations and their temporal decay. Broadly, the framework establishes temporal integration as a principled mechanism linking multiscale neural dynamics to information representation and offers testable predictions across recording modalities and neural systems.

20

PARiS: Probabilistic Assignment and Repartitioning of isomiR Sequences: A data-driven method for denoising isomiR read count data

Swan, H. K.; Baran, A. M.; Aparicio-Puerta, E.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.

2026-05-12 bioinformatics 10.64898/2026.05.09.723882 medRxiv

Top 0.5%

0.2%

Show abstract

MicroRNAs (miRNAs) are non-coding RNAs, approximately 18-24 nucleotides in length, with important gene regulatory functions. In small RNA sequencing (sRNA-seq), observed isoforms of miRNA, called isomiRs, arise from my biological and technical processes. Alterations in isomiR expression has been linked to a wide variety of human diseases, from cancers to neurological diseases. However, it is difficult to distinguish between technical and biological isomiRs. We present PARiS, an algorithm for the Probabilistic Assignment and Repartitioning of isomiR Sequences, that identifies technical error isomiRs in sRNA-seq data and reassigns them to their most likely biological source. We assess the ability of PARiS to identify and remove error isomiR sequences in a realistic simulation study. Additionally, we compare PARiS to alternative approaches, focusing on downstream miRNA-level differential expression analysis in a variety of settings, including a set of simulated datasets, an experimental benchmark dataset, and three colorectal adenocarcinoma cell lines.